22 research outputs found
Self-imitating Feedback Generation Using GAN for Computer-Assisted Pronunciation Training
Self-imitating feedback is an effective and learner-friendly method for
non-native learners in Computer-Assisted Pronunciation Training. Acoustic
characteristics in native utterances are extracted and transplanted onto
learner's own speech input, and given back to the learner as a corrective
feedback. Previous works focused on speech conversion using prosodic
transplantation techniques based on PSOLA algorithm. Motivated by the visual
differences found in spectrograms of native and non-native speeches, we
investigated applying GAN to generate self-imitating feedback by utilizing
generator's ability through adversarial training. Because this mapping is
highly under-constrained, we also adopt cycle consistency loss to encourage the
output to preserve the global structure, which is shared by native and
non-native utterances. Trained on 97,200 spectrogram images of short utterances
produced by native and non-native speakers of Korean, the generator is able to
successfully transform the non-native spectrogram input to a spectrogram with
properties of self-imitating feedback. Furthermore, the transformed spectrogram
shows segmental corrections that cannot be obtained by prosodic
transplantation. Perceptual test comparing the self-imitating and correcting
abilities of our method with the baseline PSOLA method shows that the
generative approach with cycle consistency loss is promising
Automatic Severity Assessment of Dysarthric speech by using Self-supervised Model with Multi-task Learning
Automatic assessment of dysarthric speech is essential for sustained
treatments and rehabilitation. However, obtaining atypical speech is
challenging, often leading to data scarcity issues. To tackle the problem, we
propose a novel automatic severity assessment method for dysarthric speech,
using the self-supervised model in conjunction with multi-task learning.
Wav2vec 2.0 XLS-R is jointly trained for two different tasks: severity level
classification and an auxilary automatic speech recognition (ASR). For the
baseline experiments, we employ hand-crafted features such as eGeMaps and
linguistic features, and SVM, MLP, and XGBoost classifiers. Explored on the
Korean dysarthric speech QoLT database, our model outperforms the traditional
baseline methods, with a relative percentage increase of 4.79% for
classification accuracy. In addition, the proposed model surpasses the model
trained without ASR head, achieving 10.09% relative percentage improvements.
Furthermore, we present how multi-task learning affects the severity
classification performance by analyzing the latent representations and
regularization effect
Speech Intelligibility Assessment of Dysarthric Speech by using Goodness of Pronunciation with Uncertainty Quantification
This paper proposes an improved Goodness of Pronunciation (GoP) that utilizes
Uncertainty Quantification (UQ) for automatic speech intelligibility assessment
for dysarthric speech. Current GoP methods rely heavily on neural
network-driven overconfident predictions, which is unsuitable for assessing
dysarthric speech due to its significant acoustic differences from healthy
speech. To alleviate the problem, UQ techniques were used on GoP by 1)
normalizing the phoneme prediction (entropy, margin, maxlogit, logit-margin)
and 2) modifying the scoring function (scaling, prior normalization). As a
result, prior-normalized maxlogit GoP achieves the best performance, with a
relative increase of 5.66%, 3.91%, and 23.65% compared to the baseline GoP for
English, Korean, and Tamil, respectively. Furthermore, phoneme analysis is
conducted to identify which phoneme scores significantly correlate with
intelligibility scores in each language.Comment: Accepted to Interspeech 202
Automatic Pronunciation Assessment of Korean Spoken by L2 Learners Using Best Feature Set Selection
This paper proposes a method for automatic pronunciation assessment of Korean spoken by L2 learners by selecting the best feature set from a collection of the most well-known features in the literature. The L2 Korean Speech Corpus is used for assessment modeling, where the native languages of the L2 learners are English, Chinese, Japanese, Russian, and Mongolian. In our system, learners speech is forced-aligned and recognized using a native Korean acoustic model. Based on these results, various features for pronunciation assessment are computed, and divided into four categories such as RATE, SEGMENT, SILENCE, and GOP. Pronunciation scores produced by combining categories of features by multiple linear regression are used as a baseline. In order to enhance the baseline performance, relevant features are selected by using Principal Component Regression (PCR) and Best Subset Selection (BSS), respectively. The results show that the BSS model outperforms the baseline and the PCR model, and that features corresponding to speech segment and rate are selected as the relevant ones for automatic pronunciation assessment. The observed tendency of salient features will be useful for further improvement of automatic pronunciation assessment model for Korean language learners.OAIID:RECH_ACHV_DSTSH_NO:A201625650RECH_ACHV_FG:RR00200003ADJUST_YN:EMP_ID:A076305CITE_RATE:FILENAME:2016_09 (APSIPA λ₯νμ).pdfDEPT_NM:μΈμ΄νκ³ΌEMAIL:[email protected]_YN:FILEURL:https://srnd.snu.ac.kr/eXrepEIR/fws/file/9614f371-16ac-45af-add0-9434be5bacf0/linkCONFIRM:
Analysis on Difference between Speaking Rates of Phoneme Classes and Oral Proficiency of Korean English Learners
OAIID:RECH_ACHV_DSTSH_NO:A201625644RECH_ACHV_FG:RR00200003ADJUST_YN:EMP_ID:A076305CITE_RATE:FILENAME:2016_02 (μμ±νννμ λν λλ―Όμ).pdfDEPT_NM:μΈμ΄νκ³ΌEMAIL:[email protected]_YN:FILEURL:https://srnd.snu.ac.kr/eXrepEIR/fws/file/6d783a10-6a7f-4a4e-9636-1c95c52cf78f/linkCONFIRM:
μ‘°μ κΈ°λ°μ μμ λ 벨 μ¬ν νλ₯ μ μ΄μ©ν νκ΅μΈ μμ΄νμ΅μ μ μ°½μ± μλ νκ°
OAIID:RECH_ACHV_DSTSH_NO:A201625645RECH_ACHV_FG:RR00200003ADJUST_YN:EMP_ID:A076305CITE_RATE:FILENAME:2016_03 (μμ±νννμ λν λ₯νμ).pdfDEPT_NM:μΈμ΄νκ³ΌEMAIL:[email protected]_YN:FILEURL:https://srnd.snu.ac.kr/eXrepEIR/fws/file/d29fd32e-14b6-4bef-b272-bad73234b9b8/linkCONFIRM:
νκ΅μ΄ CAPT μμ€ν μ λΆμ μ λ°μκ΅μ‘ μ°μ μμ: μ€κ΅μ΄μ μΌλ³Έμ΄κΆ νμ΅μλ€μ λ³μ΄μμμ μ€μ¬μΌλ‘
OAIID:RECH_ACHV_DSTSH_NO:A201625648RECH_ACHV_FG:RR00200003ADJUST_YN:EMP_ID:A076305CITE_RATE:FILENAME:2016_08 (μμ±νν μμΉν¬).pdfDEPT_NM:μΈμ΄νκ³ΌEMAIL:[email protected]_YN:FILEURL:https://srnd.snu.ac.kr/eXrepEIR/fws/file/7b027b5d-224c-4b3a-bebd-ff4ccf72e266/linkCONFIRM:
Assistive Program for Automatic Speech Transcription based on G2P conversion and Speech Recognition
OAIID:RECH_ACHV_DSTSH_NO:A201625646RECH_ACHV_FG:RR00200003ADJUST_YN:EMP_ID:A076305CITE_RATE:FILENAME:2016_04 (μμ±νν λλ―Όμ).pdfDEPT_NM:μΈμ΄νκ³ΌEMAIL:[email protected]_YN:FILEURL:https://srnd.snu.ac.kr/eXrepEIR/fws/file/2cf9d677-4d60-482f-8fd0-189d8afda57c/linkCONFIRM:
Optimizing Vocabulary Modeling for Dysarthric Speech Recognition
Imperfection in articulation of dysarthric speech results in the deterioration on the performance of speech recognition. In this paper, the effect of the articulating class of phonemes in the dysarthric speech recognition results is analyzed using generalized linear mixed models (GLMMs). The model with the features categorized according to the manner of articulation and the place of tongue is selected as the best one by the analysis. Recognition accuracy score for each word is predicted based on its pronunciation and the GLMM. The vocabulary optimized by selecting words with the maximum score shows a 16.4 % relative error reduction in dysarthric speech recognition.OAIID:RECH_ACHV_DSTSH_NO:A201625647RECH_ACHV_FG:RR00200003ADJUST_YN:EMP_ID:A076305CITE_RATE:FILENAME:2016_05 (ICCHP λλ―Όμ).pdfDEPT_NM:μΈμ΄νκ³ΌEMAIL:[email protected]_YN:FILEURL:https://srnd.snu.ac.kr/eXrepEIR/fws/file/5d677af3-75c7-4475-ab97-f1ae6c45ea62/linkCONFIRM: